Training tree transducers

ABSTRACT

Training using tree transducers is described. Given sample input/output pairs as training, and given a set of tree transducer rules, the information is combined to yield locally optimal weights for those rules. This combination is carried out by building a weighted derivation forest for each input/output pair and applying counting methods to those forests.

This application claims the benefit of the priority of U.S. ProvisionalApplication Ser. No. 60/553,587, filed Mar. 15, 2004 and entitled“TRAINING TREE TRANSDUCERS”, the disclosure of which is herebyincorporated by reference.

BACKGROUND

Many different applications are known for tree transducers. These havebeen used in calculus, other forms of higher mathematics. Treetransducers are used for decidability results in logic, for modelingmathematically the theories of syntax direction translations and programschemata, syntactic pattern recognition, logic programming, termrewriting and linguistics.

Within linguistics, automated language monitoring programs often useprobabilistic finite state transducers that operate on strings of words.For example, speech recognition may transduce acoustic sequences to wordsequences using left to right substitution. Tree based models based onprobabilistic techniques have been used for machine translation, machinesummarization, machine paraphrasing, natural language generation,parsing, language modeling, and others.

A special kind of tree transducer, often called an R transducer,operates with its roots at the bottom, with R standing for “root tofrontier”. At each point within the operation, the transducer chooses aproduction to apply. That choice is based only on the current state andthe current root symbol. The travel through the transducer continuesuntil there are no more state annotated nodes.

The R transducer represents two pairs, T1 and T2, and the conditionsunder which some sequence of productions applied to T1 results in T2.This is similar to what is done by a finite state transducer.

For example, if a finite state transition from state q to state r eatssymbol A and outputs symbol B, then this can be written as an Rproduction of q(A x0)->B (r x0).

The R transducer may also copy whole trees, transform subtrees, deletesubtrees, and other operations.

SUMMARY

The present application teaches a technique of training tree transducersfrom sample input/output pairs. A first embodiment trains the treepairs, while a second embodiment trains the tree transducers based ontree/string pairs. Techniques are described that facilitate thecomputation, and simplify the information as part of the trainingprocess.

An embodiment is described which uses these techniques to traintransducers for statistical based language processing: e.g. languagerecognition and/or language generation. However, it should be understoodthat this embodiment is merely exemplary, and the other applications forthe training of the tree transducers are contemplated.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other aspects will now be described in detail with referenceto the accompanying drawings, wherein:

FIG. 1 shows derivation trees and their simplifications;

FIG. 2A shows a flowchart;

FIG. 2B shows a speech engine that can execute the flowchart of FIG. 2A;

FIG. 2C shows a flowchart of a second embodiment; and

FIGS. 3 and 4 show a model and parameter table.

DETAILED DESCRIPTION

The present application describes training of tree transducers. Theembodiment describes training of tree transducers, e.g., probabilistic Rtransducers. These transducers may be used for any probabilisticpurpose. In an embodiment, the trained transducers are used forlinguistic operations, such as machine translation, paraphrasing, textcompression and the like. Training data may be obtained in the form oftree pairs. Linguistic knowledge is automatically distilled from thosetree pairs and transducer information.

TΣ represents the set of trees over the alphabet Σ. An alphabet is afinite set of symbols. Trees may also be written as strings over the setΣ.

A regular tree grammar or RTG allows compactly representing apotentially infinite set of trees. A weighted regular tree grammar is aset of values, where trees in the set have weights associated with them.The trees can be described as a quadruple G (Σ, N, S, P), where Σ is thealphabet, and N is the set of non-terminals, S is the starting (initial)terminal, and P is the set of weighted productions. The productions arewritten left to right. A weighted RTG can accept information from aninfinite number of trees. More generally, the weighted RTG can be anylist which includes information about the trees in a tree grammar, in away that allows the weight to change rather than a new entry each timethe same information is reobtained.

The RTG can take the following form: TABLE I Σ = {S, NP, VP, PP, PREP,DET, N, V, run, the, of, sons, daughters} N = {qnp, qpp, qdet, qn,qprep} S = q P = {q →^(1.0) S(qnp, VP(VB(run))), qnp →^(0.6) NP(qdet,qn), qnp →^(0.4) NP(qnp, qpp), qpp →^(1.0) PP(qprep, np), qdet →^(1.0)DET(the), qprep →^(1.0) PREP(of), qn →^(0.5) N(sons), qn →^(0.5)N(daughters)}

The tree is parsed from left to right, so that the leftmost non-terminalis the next one to be expanded as the next item in the RTG. The leftmost derivations of G build a tree pre-order from left to rightaccording toLD(G)≡{(t, ((p ₁ , r ₁), . . . , (p _(n) , r _(n)))εD _(G)|∀1≦i<n:p_(i+1)≮_(lez) p _(i)}

The total weight of t in G is given by W_(G):T_(Σ)→R, the sum ofleftmost den derivations producing t:${{W_{G}(t)} \equiv {\sum\limits_{{({t,h})} \in {{LD}{(G)}}}{\prod\limits_{i = 1}^{n}{w_{i}\quad{where}}}}}\quad$  h = (h₁, …  , h_(n))  and  h_(i) = (p_(i), (l_(i), r_(i), w_(i)))

Therefore, for every weighted context free grammar, there is anequivalent weighted RTG that produces weighted derivation trees. Eachweighted RTG is generated from exactly the recognizable tree language.

An extended transducer are is also used herein. According to thisextended transducer xR, an input subtree matching pattern in state q isconverted into its right hand side (“rhs”), and it's Q paths arereplaced by their recursive transformations. The right hand side ofthese rules may have no states for further expansions (terminal rules)or may have states for further expansion. In notation form$\left. \Rightarrow{x \equiv \left\{ {{\left( {\left( {a,h} \right),\left( {b,{h \cdot \left( {i,\left( {q,{pattern},{rhs},w} \right)} \right)}} \right)} \right)❘{\left( {q,{pattern},{rhs},w} \right) \in {R\bigwedge i} \in {{paths}_{a}\bigwedge q}}} = {{{{label}_{a}(i)}\bigwedge{{pattern}\left( a\downarrow\left( {i \cdot (1)} \right) \right)}} = {{1\bigwedge b} = {a\left\lbrack i\leftarrow{{rhs}\begin{bmatrix}{\left. p\leftarrow{q^{\prime}\left( a\downarrow\left( {i \cdot (1) \cdot i^{\prime}} \right) \right)} \right.,} \\{{\forall{p \in {{paths}_{rhs}\text{:}{{label}_{rhs}(p)}}}} = \left( {q^{\prime},i^{\prime}} \right)}\end{bmatrix}} \right\rbrack}}}} \right\}} \right.$where, b is derived from a by application of a

-   rule (queue, pattern)->rhs-   to an unprocessed input subtree ai which is in state q.    Its output is replaced by the output given by rhs. Its non-terminals    are replaced by the instruction to transform descendent input    subtrees.

The sources of a rule r=(q, l, rhs, w) ε R are the input-paths in therhs:sources(rhs)≡{i ^(l) |∃p εpaths_(rhs)(Q×paths), q ^(l)εQ:label_(rhs)(p)=(q ^(l) , i ^(l))}

The reflexive, transitive closure of

x is written

*_(x), and the derivations of X, written D(X), are the ways oftransforming input tree I (with its root in the initial state) to anoutput tree O:D(X)≡{(I, O, h)εT _(Σ) ×T _(Δ)×(paths×P)*|(Q _(i)(I)( ))

*_(x)(O, h)}The leftmost derivations of X transform the tree-preorder from left toright (always applying a transformation rule to the state-labeledsubtree furthest left in its string representation):LD(X)≡{(I, O, ((p ₁ , r ₁), . . . , (p _(n),r_(n)))εD(X)|∀1≦i<n:p_(i+1)≮_(lez) p _(i)}The total weight of (I, O) in X is; given by W_(x):T_(Σ)×T_(Δ)→R, thesum of leftmost derivations transforming I to O:${W\quad{x\left( {I,O} \right)}} \equiv {\sum\limits_{{({I,O,h})} \in {{LD}{(X)}}}{\prod\limits_{i = 1}^{n}{w_{i}\quad{where}}}}$h = (h₁, …  , h_(n))  and  h_(i) = (p_(i), (l_(i), r_(i), w_(i)))

The tree transducers operate by starting at an initial state root andrecursively applying output generating rules until no states remain, sothat there is a complete derivation. In this way, the information (treesand transducer information) can be converted to a derivation forest,stored as a weighted RTG.

The overall operation is illustrated in the flow chart of FIG. 2A; andFIG. 2B illustrates an exemplary hardware device which may execute thatflowchart. For the application of language translation, a processingmodule 250 receives data from various sources 255. The sources may bethe input and output trees and transducer rules described herein.Specifically, this may be the translation memories, dictionaries,glossaries, Internet, and human-created translations. The processor 250processes this information as described herein to produce translationparameters which are output as 260. The translation parameters are usedby language engine 265 in making translations based on input language270. In the disclosed embodiment, the speech engine is a languagetranslator which translates from a first language to a second language.However, alternatively, the speech engine can be any engine thatoperates on strings of words such as a language recognition device inspeech recognition device, a machine paraphraser, natural languagegenerator, modeler, or the like.

The processor 250 and speech engine 265 may be any general purposecomputer, and can be effected by a microprocessor, a digital signalprocessor, or any other processing device that is capable of executingthe steps described herein.

The flowchart described herein can be instructions which are embodied ona machine-readable medium such as a disc or the like. Alternatively, theflowchart can be executed by dedicated hardware, or by any known orlater discovered processing device.

The system obtains a plurality of input and output trees or strings, andtransducer rules with parameters. The parameters may then be used forstatistical machine translation. More generally, however, the parameterscan be used for any tree transformation task.

At 210, the input tree, output tree and tranducer rules are converted toa large set of individual derivation trees, “a derivation forest”.

The derivation forest effectively flattens the rules into trees of depthone. The root is labeled by the original rule. All the non-expanding Alabeled nodes of the rule are deterministically listed in order. Theweights of the derivation trees are the products of the weights of therules in those derivation trees.

FIG. 1 illustrates an input tree 100 being converted to an output tree110 and generating derivation trees 130. FIG. 1 also shows thetransducer rules 120. All of these are inputs to the system,specifically the input and output tree are the data that is obtainedfrom various language translation resources 255, for example. Thetransducer rules are known. The object of the parsing carried out inFIG. 1 is to derive the derivation trees 130 automatically.

The input/output tree pairs are used to produce a probability estimatefor each production in P, that maximizes the probability of the outputtrees given the input trees. The result is to find a local maximum. Thepresent system uses simplifications to find this maximum.

The technique describes the use of memoization by creating the weightedRTG's. Memoization means that the possible derivations for a givenproduced combination are constant. This may prevent certain combinationsfrom being computed more than once. In this way, the table, here thewRTG can store the answers for all past queries and return those insteadof recomputing.

Note the way in which the derivation trees are converted to weightedRTG's. At the start, rule one will always be applied, so the first RTGrepresents a 1.0 probability of rule one being applied. The arguments ofrule one are 1.12 and 2.11. If 1.12 is applied, rule 2 is always used,while 2.11 can be either rule 3 or rule 4, with the different weightingsfor the different rules being also shown.

At 230, the weighted RTG is further processed to sum the weights of thederivation trees. This can use the “inside-outside” technique, (Lari, etal, “The estimation of stochastic context free grammars using theinside-outside algorithm, Computer Speech and Language, 4, pp 35-36).The inside-outside technique observes counts and determines each time arule gets used. When a rule gets used, the probability of that rule isincreased. More specifically, given a weighted RTG with parameters, theinside outside technique enables computing the sums of weights of thetrees derived using each production. Inside weights are the sum of allweights that can be derived for a non-terminal or production. This is arecursive definition. The inside weights for a production are the sum ofall the weights of the trees that can be derived from that production.${\beta_{G}\left( {n \in N} \right)} \equiv {\sum\limits_{{({n,r,w})} \in P}{w \cdot {\beta_{G}(r)}}}$${\beta_{G}\left( {{r \in {T_{\Sigma}(N)}}❘{\left( {n,r,w} \right) \in P}} \right\}} \equiv {\prod\limits_{p \in {{paths}_{r}{(N)}}}{\beta_{G}\left( {{label}_{r}(p)} \right)}}$

The outside weights for a non-terminal are the sum of weights of treesgenerated by the weighted RTG that have derivations containing it butexclude its inside weights, according to${\alpha_{G}\left( {n \in N} \right)} \equiv \left\{ \begin{matrix}\quad & \begin{matrix}1 & {{{if}\quad n} = S}\end{matrix} \\{{\overset{\overset{{uses}\quad{of}\quad n\quad{in}\quad{productions}}{︷}}{\sum\limits_{p,{{{({n^{\prime},r,w})} \in {P:{{label}_{r}{(p)}}}} = n}}{w \cdot {\alpha_{G}\left( n^{\prime} \right)}}} \cdot \underset{\underset{{sibling}\quad{nonterminals}}{︸}}{\prod\limits_{p^{\prime} \in {{{paths}_{r}{(N)}} - {\{ p\}}}}{\beta_{G}\left( {{label}_{r}\left( p^{\prime\quad} \right)} \right)}}}\quad} & {{otherwise}.}\end{matrix} \right.$

Estimation maximization training is then carried out at 240. Thismaximizes the expectation of decisions taken for all possible ways ofgenerating the training corpus, according to expectation, and thenmaximization, as:$\forall{p \in {{{parameters}\text{:}{counts}_{p}} \equiv {E_{t \in {{training}\quad}}{\quad\left\lbrack \frac{\begin{matrix}{\sum\limits_{d \in {derivations}_{i}}{\left( {\#\quad{of}\quad{times}\quad p\quad{used}\quad{in}\quad d} \right) \cdot}} \\{{weight}_{parameters}(d)}\end{matrix}}{\sum\limits_{d \in {derivations}_{i}}{{weight}_{parameters}(d)}} \right\rbrack}}}}$

2. Maximizing by assigning the counts to the parameters andrenormalizing$\forall{p \in {{parameters}:\left. p\leftarrow\frac{{counts}_{p}}{Z(p)} \right.}}$

Each iteration increases the likelihood until a local maximum isreached.

The step 230 can be written in pseudocode as:For  each  (i, o, w_(example)) ∈ T://Estimate i.  Let  D ≡ d_(i, o)ii.  compute  α_(D), β_(D)  using  latest  W//inside-outside  weightsiii.  For  each  prod = (n, rhs, w) ∈ P : label_(r)hs((  )) ∈ R  in  derivation  wRTG  D = (R, N, S, P):    A.  γ_(D)(prod) ← α_(G)(n) ⋅ w ⋅ β_(G)(rhs)  B.  Let  rule ≡ label_(rhs)((  ))$\quad\left. {C.\quad{count}_{rule}}\leftarrow{{count}_{rule} + {w_{{example}\quad} \cdot \frac{\gamma_{D}({prod})}{\beta_{D}(S)}}} \right.$iv.  L ← L + log   β_(D)(S) ⋅ w_(example)For  each  r = (q, pattern, rhs, w) ∈ R://Maximize$\left. {i.\quad w_{r}}\leftarrow\frac{{count}_{r}}{Z\left( {{counts},r} \right)} \right.$$\left. \delta\leftarrow\frac{L - {lastL}}{L} \right.$lastL ← L, itno ← itno + 1By using the weighted RTG's, each estimation maximum iteration takes anamount of time that is linear to the size of the transducer. Forexample, this may compute the sum of all the counts for rules having thesame state, to provide model weights for a joint probabilitydistribution of the input output tree pairs. This joint normalizationmay avoid many different problems.

The above has described tree-to-tree transducers. An alternativeembodiment describes tree to string transducers is shown in theflowchart of FIG. 2C. This transducer will be used when a tree is onlyavailable at the input side of the training corpus. Note that FIG. 2C issubstantially identical to FIG. 2A other than the form of the inputdata.

The tree to string transduction is then parsed using an extended Rtransducer as in the first embodiment. This is used to form a weightedderivation tree grammar. The derivation trees are formed by convertingthe input tree and the string into a flattened string of informationwhich may include trees and strings. 285 of FIG. 5 c simply refers tothis as derivation information. The parsing of the tree to stringtransduction may be slightly different then the tree to treetransduction. Instead of derivation trees, there may be output stringspans. A less constrained alignment may result.

This is followed in FIG. 2C by operations that are analogous to those inFIG. 2A: specifically, creation of the weighted RTG, the same as theweight summing of 230 and the expectation maximization of 240.

EXAMPLE

An example is now described here in of how to cast a probabilisticlanguage model as an R transducer.

Table 2 shows a bilingual English tree Japanese string training corpus.TABLE 2 ENGLISH: (VB (NN hypocrisy)  (VB is)  (JJ (JJ abhorrent)   (TO(TO to) (PRP them)))) JAPANESE: kare ha gizen ga daikirai da ENGLISH:(VB (PRP he)  (VB has)  (NN (JJ unusual) (NN ability))  (IN (IN in) (NNenglish))) JAPANESE: kare ha eigo ni zubanuke-ta sainou wo mot-te iruENGLISH: (VB (PRP he)  (VB was)  (JJ (JJ ablaze)   (IN (IN with) (NNanger)))) JAPANESE: kare ha mak-ka ni nat-te okot-te i-ta ENGLISH: (VB(PRP i)  (VB abominate)  (NN snakes)) JAPANESE: hebi ga daikirai da etc.

FIGS. 3 and 4 respectively show the generative model and its parameters.The parameter values that are shown are learned via expectationmaximization techniques as described in Yamada and Knight 2001.

According to the model, an English tree becomes a Japanese string infour operations. FIG. 3 shows how the channel input is first reordered,that is its children are permeated probabilistically. If there are threechildren, then there are six possible permutations whose probabilitiesadd up to one. The reordering is done depending only on the child labelsequence.

In 320, a decision is made at every node about inserting a Japanesefunction word. This is a three-way decision at each node, requiringdetermination of whether the word should be inserted to the left, to theright, or not inserted at all. This insertion technique at 320 dependson the labels of the node and the parent. At 330, the English leaf wordsare translated probabilistically into Japanese, independent of context.At 340, the internal nodes are removed, leaving only the Japanesestring.

This model can effectively provide a formula for

-   P. (Japanese string|English tree)    in terms of individual parameters. The expectation maximization    training described herein seeks to maximize the product of these    conditional probabilities based on the entire tree-string corpus.

First, an xRs tree to string transducer is built that embodies theprobabilities noted above. This is a four state transducer. For themain-start state, the function q, meaning translate this tree, has threeproductions:

-   -   q x→i x, r x    -   q x→r x, i x    -   q x→r x        State 5 means “produce a Japanese word out of thin air.” There        is an i production for each Japanese word in the vocabulary.    -   i x→“de”    -   i x→“kuruma”    -   i x→“wa”    -   . . .

State r means “reorder my children and then recurse”. For internalnodes, this includes a production for each parent/child sequence, andevery permutation thereof:

-   -   r NN(x0:CD, x1:NN)→q x0, q x1    -   r NN(x0:CD, x1:NN)→q x1, q x0    -   . . .

The RHS then sends the child subtrees back to state q for recursiveprocessing. For English leaf nodes, the process instead transitions to adifferent state t to prohibit any subsequent Japanese function wordinsertion:

-   -   r NN(x0:“car”)→t x0    -   r CC (x0:“and”)→t x0    -   . . .

State t means “translate this word”. There is a production for each pairof cooccuring in English and Japanese words.

-   -   t “car”→“kuruma”    -   t “car”→*e*    -   t “car”→*e*    -   . . .

Each production in the XRS transducer has an associated weight, andcorresponds to exactly 1 of the model parameters.

The transducer is unfaithful in one respect, specifically the insertfunction word decision is independent of context. It should depend onthe node label and the parent label. This is addressed by fixing the qand r production. Start productions are used:

-   -   q x:VB→q.TOP.VB x    -   q x:JJ→q.TOP.JJ x    -   . . .

States are used, such as q.top.vb which states mean something like“translate this tree, whose route is vb”. Every parent-child payer inthe corpus gets its own set of insert function word productions:

-   -   q.TOP.VB x →i x, r x    -   q.TOP.VB x →r x, i x    -   q.TOP.VB x →r x    -   q.VB.NN x →i x, r x    -   q.VB.NN x →r x, i x    -   q.VB.NN x →r x    -   . . .

Finally, the R productions need to send parent child information whenthey recurse to the q.parent.child states.

The productions stay the same. Productions for appraisal translationsand others can also be added.

Although only a few embodiments have been disclosed in detail above,other modifications are possible, and this disclosure is intended tocover all such modifications, and most particularly, any modificationwhich might be predictable to a person having ordinary skill in the art.For example, an alternative embodiment could use the same techniques forstring to string training, based on tree based models or based only onstring pair data. Another application is to generate likely input treesfrom output trees or vide versa. Also, and to reiterate the above, manyother applications can be carried out with tree transducers, and theapplication of tree transducers to linguistic issues is merelyexemplary.

Also, only those claims which use the words “means for” are intended tobe interpreted under 35 USC 112, sixth paragraph. Moreover, nolimitations from the specification are intended to be read into anyclaims, unless those limitations are expressly included in the claims

All such modifications are intended to be encompassed within thefollowing claims

1. A method, comprising: obtaining tree transducer information includinginput/output pair information and transducer information; convertingsaid input/output pair information and said transducer information intoa set of values in a weighted tree grammar; and using said weighted treegrammar to solve a problem that requires information from theinput/output pair information and transducer information.
 2. A method asin claim 1, wherein said using comprises training a linguistic enginewhich solves a linguistic problem.
 3. A method as in claim 2, whereinsaid linguistic problem includes training of a linguistic engine of atype that converts from one language to another.
 4. A method as in claim1, wherein said tree transducer information includes an input tree andan output tree as said input/output pair.
 5. A method as in claim 1,wherein said tree transducer information includes an input tree and anoutput string as said input/output pair.
 6. A method as in claim 1,further comprising converting said tree transducer information into aderivation forest.
 7. A method as in claim 1, wherein said set of valuesrepresents information about the tree transducer information andtransducer information in a specified grammar, associated with a weightfor each of a plurality of entries.
 8. A method as in claim 7, whereinsaid set of values are in a weighted regular tree grammar.
 9. A methodas in claim 8, wherein said converting comprises storing the set ofvalues as a weighted regular tree grammar, and returning certain storedinformation instead of recomputing said certain stored information. 10.A method as in claim 1, further comprising further processing the set ofvalues to sum weights based on information learned from said treetransducer information and said transducer information.
 11. A method asin claim 10, wherein said further processing comprises using aninside-outside algorithm to observe counts and determine each time arule gets used and to adjust said rule weights based on said observing.12. A method as in claim 10, wherein said further processing comprisesobserving counts and determining each time a rule gets used, andincreasing a probability of that rule each time the rule gets used. 13.A method as in claim 1, wherein said training further comprisesmaximizing an expectation of decisions.
 14. A method as in claim 1,wherein said training further comprises computing a sum of all thecounts for rules having the same state, to provide model weights for oneof a joint or conditional probability distribution of the treetransducer information.
 15. A method as in claim 1, wherein said usingcomprises solving a logic problem.
 16. A method as in claim 15, whereinsaid logic problem is a problem that uses a machine to analyze and theaspect of at least one language.
 17. A training apparatus comprising: aninput part which includes tree transducer information that includesinput samples and output samples, and includes transducer information; aprocessor that processes said tree transducer information and saidtransducer information and produces an output table indicative of saidtree transducer information and said transducer information, said outputtable including a set of values, and weights for said set of values. 18.A training apparatus as in claim 17, further comprising a linguisticengine, receiving said output tables, addressing a linguistic problembased on said output tables.
 19. A training apparatus as in claim 17,further comprising using an inside-outside algorithm to observe countsand determine each time a rule gets used and to adjust said rule weightsbased on the observing.
 20. A training apparatus as in claim 17, whereinsaid input part is a tree transducer.
 21. A training apparatus as inclaim 20, wherein said output part is a tree transducer.
 22. A trainingapparatus as in claim 20, wherein said output of part is a stringtransducer.
 23. A training apparatus as in claim 17, wherein said outputtable stores a set of values as a regular tree grammar that is weightedwith weights related to a number of times that each rule is used.
 24. Atraining apparatus as in claim 23, wherein said processor is alsooperative to adjust said weights.
 25. A training apparatus as in claim17, further comprising a logic engine that analyzes at least one logicalproblem based on said output table.
 26. A computer readable storagemedium containing a set of instructions to enable a computer to carryout a function, the set of instructions including instructions to:obtain tree transducer information including at least an input tree,output information corresponding to said input tree, and transducerinformation; convert said tree transducer information into a weightedset of values indicative of said tree transducer information and saidtransducer information; and use said set of values to solve a problem.27. A storage medium as in claim 26, wherein said output informationwithin said tree transducer information includes an output tree.
 28. Astorage medium as in claim 26, wherein said output information withinsaid tree transducer information includes an output string.
 29. Astorage medium as in claim 26, wherein said instructions to convertinclude instructions to maximize an expectation that a rule will beused.
 30. A method comprising: using a computer to obtain information inthe form of a first-tree, second information corresponding to said firsttree, and transducer information; and using said computer toautomatically distill information from said first tree, said from saidsecond information, and said transducer information into a list ofinformation in a specified tree grammar with weights associated withentries in the list and to produce locally optimal weights for saidentries.
 31. A method as in claim 30, further comprising using said listof information in said computer to solve a problem.
 32. A method as inclaim 30, wherein said automatically distill comprises parsing theinformation from a first portion to a second portion.
 33. A method as inclaim 32, wherein said second information comprises a second tree.
 34. Amethod as in claim 32, wherein said second information comprises astring.
 35. A method as in claim 30, wherein said automatically distillcomprises observing when a specified rule in said list of information isused, and increasing said weight associated with said specified rulewhen said specified rule is used.
 36. A method as in claim 31, whereinsaid problem is a linguistic problem.